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Abstract 


Models of complex systems are often formalized as sequential software simulators: compu¬ 
tationally intensive programs that iteratively build up probable system configurations given 
parameters and initial conditions. These simulators enable modelers to capture effects that 
are difficult to characterize analytically or summarize statistically. However, in many real- 
world applications, these simulations need to be inverted to match the observed data. This 
typically requires the custom design, derivation and implementation of sophisticated inver¬ 
sion algorithms. Here we give a framework for inverting a broad class of complex software 
simulators via probabilistic programming and automatic inference, using under 20 lines of 
probabilistic code. Our approach is based on a formulation of inversion as approximate in¬ 
ference in a simple sequential probabilistic model. We implement four inference strategies, 
including Metropolis-Hastings, a sequentialized Metropolis-Hastings scheme, and a particle 
Markov chain Monte Carlo scheme, requiring 4 or fewer lines of probabilistic code each. 

We demonstrate our framework by applying it to invert a real geological software simulator 
from the oil and gas industry. 

1. Introduction 

Sequential software simulators are used to model complex systems in fields ranging from 
geophysics (Symes et ah, 2011) to finance (Calvet and Fisher, 2007). They can capture 
dynamics that produce effects which are difficult or impossible to characterize analytically 
or to summarize statistically. However, the real-world problems faced by modelers often 
require inference, not just simulation. For example, prediction tasks require identifying real¬ 
izations of a simulation that are compatible with observed data. The problem of identifying 
probable realizations of a simulator given data is sometimes called simulator inversion. Both 
deterministic, optimization-based methods (Boschetti et ah, 1996), (Ramillien, 2001) and 
stochastic, sampling based (Malinverno, 2002), (Chen et ah, 2006) methods are sometimes 
applied. Applying a standard technique to a new simulator or developing a new method for 
an existing simulator requires developing and implementing custom algorithms. 

In this paper we show how to use probabilistic programming and automatic inference to 
formulate and solve a broad class of inversion problems. We define a simple interface to a 
sequential software simulator, and define a probabilistic model and approximate inference 
problem for inversion given that interface. This formulation requires under 20 lines of 
probabilistic code. We also describe four Monte Carlo inference strategies for solving the 
inversion problem, each requiring 4 or fewer lines of probabilistic code. We demonstrate 
our framework by applying it to invert a real geological software simulator from the oil and 
gas industry. 
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2. A framework for inverting sequential simulation software 


To invert sequential simulators, we define a probabilistic model over their parameters that 
encodes generic priors, and use approximate inference methods to infer probable values 
given data. 





1. So = Initialize (Ainzt) and 

Pinit(^So] Xinit^ ~ Pl'Ob_Init (sq > A-jnzt ) 

2. Uf — Sample (Aparams) and 

Pparams(ut’i Xparams) ~ Pi'Ob-Smp (iZ-t j Aparams ) 

3. St = Simulate(st-i,and 
Psim{st\st-i,ut) = Prob_Sim(st_i 

4. ot = Emit(st) and 
Pemit{ot]St) = Prob_Emt (ot ,st) 

5. kj{ot,r) = Comp_Distance_Likelihood(ot, r) 


Figure 1: Our framework for inverting sequential simulators, {left) A probabilistic graphical model 
that describes the inference problem corresponding to simulator inversion. Each slice corresponds 
to a step in the sequential simulation, capturing the dependence of the new state on the previous 
state and new input parameters. See main text for more details, {right) The procedural interface 
for specifying the simulator. 

We assume the simulator is Markovian; that is, at every time point t G {1,..., T}, we 
have a state variable st which only depends on the previous state st-i and the parameter(s) 
for the state Ui'. Si\si—i^Ui ^ Psim{^t—l-)'^t) and Ut\Xparams ^ Pparamsi^params) • For the 
initial state, we assume so\Xinit ^ PinitiKnit)- Moreover, at every t given the current 
state an emission ot is generated from a distribution, ot\st Pemit{st). To afford the 
flexibility for many different forms of observable data, we allow simulators to come with 
arbitrary per-step likelihood terms. These terms are incorporated by defining a Bernoulli 
distribution with parameter k^{ot^ r), where r is the real data and jo^, r, 7 
We provide an example of the kry{ot^r) in Section 2.3. 


2.1. Procedural interface for specifying sequential simulators 

This probabilistic model lends itself to a natural software interface that can be satisfied by 
many sequential simulators: 

1. So = Initiallze (Xinit) Sindpinit{so; Xinit) = Prob_Init (sq , : These procedures 

return the state of the simulator at initialization and compute the probability of sam¬ 
pling state 5o from the initializing distribution respectively. 

2. Uf = Sample (Apctrams) Pparamsi'^t] ^params) ~ Pi*ob_Smp , ApQ,^- 0 , 777 , 5 ): Thcsc pro¬ 

cedures sample the parameters for a time point ut and compute the probability of 
sampling. 

3. St = Simulate (st_i, Sind Psim{su = Prob_Sim( 5 t_i , 1 ^^): Given the cur¬ 

rent state of the simulator at time t and the parameter(s) for that time point, these 
procedures return the next state Sf and compute the probability of sampling. 


2 



INVERTING Sequential Simulators Using Probabilistig Programming 


4. ot = EmitCst) djid Pemit{ot] St) = Prob_Emit (o^, 5 ^): Given the current state at time 
t, these procedures emit the observation for that time point and computes the prob¬ 
ability of emission. 

5. k^{ot^r) = Comp_Distance_Likelihood(ot, r) : Given the observation and the real 
data, this procedure calculates the probability of having an observation at time t. 

2.2. A real-world example: inverting a geological forward model 

We focus on a simulator developed by an oil and gas company. In this model, the states 
{st) correspond to geological features called a lobes and the emissions (o^) correspond to the 
porosities of the substrate. The parameters ut for each state are n-tuples of independent 
uniform random variables, ut ^ Uni/[0, l]’^. The real data r is given by a set of L well logs^ 
or sequences of porosities at varying heights, each at a different location gi in the geological 
model. Figure 6 shows a generated sample from the geological simulator. In our dataset, 
well logs are available for L = 7 wells. The color of the lobe in renderings from the simulator 
represents the porosity at that lobe. See Figure 2 for a visualization of lobe formation and 
Figure 6 for the final output stratigraphy showing all 7 wells. 

The simulator builds a lobe st according to a complex geological model 4^ which given 
st-i and Ut is deterministic. The emission at a well location, denoted by ot/^ is a function 
$ of the current lobe st and location g£. We assume the initial state sq is a function of a 
hyperparameter Xinit and the emissions at different locations are independent. For every 
emission ot/ there will be a corresponding real well log for the same location and lobe 
rt,e- We set ot = , ot,L) and rj = {rt,i,rt,L)- 

The generative model for the observations at each lobe can be summarized as: 

So\XiYiit ^ Piniti^init) 

Ut ^ Unif[0, 1]"" 

St\st-l^Ut ^ 

Ot/\st ^ 

dt\ot,r,-f ^ Br{k^{ot,r)) 

The problem of stochastic inversion of a sequential simulator is finding the joint pos¬ 
terior of the states and the parameters given a sequence of real data r (i.e. finding 
p{ui:t^ si:t^oi^t\di:t) 1^ fhc modcl of Section 2). For the rest of this paper, we assume the 
model to be the geological model defined in this section. 

2.3. Distance based likelihood function 

In a complex simulator, the likelihood is intractable and an ABC filtering (Jasra et ah, 
2012 ) or ABC-MCMC (Marjoram et ah, 2003) methods may be useful. However, in the 
geological model (without the distance based likelihood) Pemit{ot\st) and Psim{st\st-i^ut) 
are delta distributions with a single atom. Thus, the likelihood is tractable. However, 

1. To be more precise, every emission ot^i at well is a sequence of porosities indexed by height h. The 
height at the end of lobe t at well i is denoted by endt^f, hence, the generated observation at well i and 
lobe t is given by • • •, T^endt,i,i)- Similarly, we can define the sequence of 

porosities for the real well logs. 
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Figure 2: Lobe formation in the model. (Left) The side view of stratigraphy at a sample well location 
for sequential sampling of the lobes (for lobes 1, 10, and 30). {Right) The generated well log for 
the same well. The x axis is the porosity value and the y axis is the height. 

the deterministic structure of the simulator can result in poor performance of any inference 
method. For instance, in an SMC scheme, Pemit{ot\st)Psim{st\st-i^ ut)^ which appears in the 
weight updating equations of particles, will be zero for all the particles with probability one. 
This explains the reason behind defining the likelihood in terms of the distance between the 
real data and the generated observation. 

Recall that, for lobe t and well we denote the generated data by Ot/ and the real data 
by Tt/. We assume that the error values for different locations are independent. For each 
location we use the following kernel to compute the error rt/) = exp{—'y\\ot/ — rt/\\) 

where 7 is the parameter of the kernel which controls the bandwidth. 

2.4. Probabilistic programming and automatic inference 

We use Venture (Mansinghka et ah, 2014) to represent our probabilistic model, including 
an interface to external simulation software. We use the automatic inference mechanisms in 
Venture to implement several strategies that can be applied to any simulator that satisfies 
the requirements of our interface. The Venture program we use for inverting all such 
simulators requires under 20 lines of probabilistic code. 

Figure 3 shows the probabilistic code for the model and four inference strategies. We 
focus on inversion strategies built out of the building blocks provided by Metropolis Hastings 
(MH) and particle Markov chain Monte Carlo methods (PMCMC) (Andrieu et ah, 2010). 
Each of the four strategies we present requires 4 or fewer lines of probabilistic code to 
implement. See (Mansinghka et ah, 2014) for details regarding the syntax and semantics of 
Venture. 

3. Experiments 

We report two experimental results. First, we compare the performance of particle Gibbs, 
Metropolis-Hastings and sequential Metropolis-Hastings. For all three methods, we set the 
kernel bandwidth 7 to 1. We use 10 lobes; each of these problem instances involves exploring 
a jagged 50-dimensional energy landscape. Figure 4 shows the results. On this problem, 
we find sequential Metropolis-Hastings to be the most effective, with particle Gibbs also 
exhibiting reasonable performance. Pure Metropolis-Hastings occasionally performs quite 
well but exhibits higher variance, frequently getting stuck in local minima. 
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Figure 3: A probabilistic program implementing our framework for inverting sequential sim¬ 
ulators. (Top code block) Venture code for the probabilistic model, using a single procedure sim 
to interface with external simulation software (in our experiments , via a Python to MATLAB 
link). (Middle left) The code for loading in observations (e.g. from well logs) and for running 
a particle Gibbs method for 50 iterations and 10 particles. (Middle right) Running single-site 
(random scan) MH for 500 iterations. (Bottom right) Sequential MH with lOt iterations over 
the first t observations. In this strategy, data incorporation is interleaved with inference, to 
incrementally account for strong sequential dependencies. (Bottom left) A hybrid method based 
on alternating particle Gibbs and single-site (random scan) MH. 


[assume sim (make_simulator) ] 

[assume get_init (lambda () (sim ’initialize)) ] 

[assume get_params (mem (lambda (t) (scope_include 0 t (make_array (uniform_continuous 0 1) 5)))) ] 
[assume get_state (mem (lambda (t) (sim ’simulate (get_params t) t (if (= t 0) 

(get_init) 

(get.state (- t 1))))))] 

[assume get_emission (mem (lambda (t) (sim ’emit (get_state t)))))] 

[assume get_distance (mem (lambda (t) (sim ’distance (get_emission t)))))] 


// Particle Gibbs 
for t = 1...T: 

[observe (log_flip (get_distance t)) true] 
[infer (pgibbs 0 ordered 10 50)] 


// Metropolis Hastings 
for t = 1...T: 

[observe (log_flip (get_distance t)) true] 
[infer (mh default one 500)] 


// Hybrid PGibbs-MH 
for t = 1...T: 

[observe (log_flip (get_distance t)) true] 
[infer (cycle ((pgibbs 0 ordered 10 10) 

(mh default one 50)) 10)] 


// Sequential MH 
for t = 1...T: 

[observe (log_flip (get_distance t)) true] 
for i == 1...t: 

[infer (mh 0 one 10)] 



Logscore Logscore 



Figure 4: Comparison of automatic inference methods on the geological simulator. We show 
histograms of log probability for 30 independent runs of each method. We also show the trajec¬ 
tories taken by the Metropolis-Hastings method. 


The three methods were run for roughly equivalent numbers of iterations (500), and 
their runtimes were all within a factor of 2 of each other. This is due to the runtime being 
dominated by calls to the simulator, and one iteration resulting in about 5-10 simulator 
calls for all three methods. 

Figure 5 shows typical results for a larger-scale experiment searching over 80 lobes (400 
dimensions). Many complex features of the well logs are captured by our method, although 
some wells are only poorly explained. Based on discussions with geologists, these results 
are comparable in quality to those obtained via a custom optimization-based baseline. It 
thus may be possible to improve accuracy as well as reduce code complexity by developing 
a more sophisticated inference scheme that can be automatically applied to this broad class 
of inversion problems. 


4. Discussion 

These preliminary results show that it is possible to use a general-purpose probabilistic 
programming system with only automatic, general-purpose inference mechanisms to invert 
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(a) PGibbs - Well #4 (b) PGibbs - Well #5 (c) PGibbs - Well #6 

Figure 5: Typical large-scale (80 lobe) well log fits. (a,h,c) Inverted (green) and true (black) well 
logs obtained using particle Gibbs with 250 particles and 2 transitions; results correspond to the 
particle with highest weight. 


sophisticated software simulators. A 10-line probabilistic program suffices. Multiple infer¬ 
ence strategies can be specified with 4 or fewer lines, and can be compared to produce an 
ensemble of probable inversions. If the underlying simulator is changed, the only change to 
the probabilistic program that is needed is to generate the appropriate random parameters 
per simulation step. The inference programs do not need to be changed at all, even though 
the transition operators they induce may be quite different. 



Figure 6: Final output stratiagraphy, showing the location of all 7 wells and many of the lobes. 

Future work will investigate more sophisticated models and automatic, general-purpose 
inference schemes, as well as applications to other simulators. It would be especially inter¬ 
esting to address statistical issues in inversion, for example by augmenting our simulator 
interface to expose and label parameters that affect model complexity or adjust the res¬ 
olution of the data, and using model selection and parameter estimation to adjust them 
appropriately. 
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